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Abstract 

The latest Graphics Processing Units (GPUs) are reported to reach up 
to 200 billion floating point operations per second (200 Gflops |19p and 
to have price performance of 0.1 cents per M flop. These facts raise great 
interest in the plausibility of extending the GPUs' use to non-graphics 
applications, in particular numerical simulations on structured grids (lat- 
tice). We review previous work on using GPUs for non-graphics applica- 
tions, implement probability-based simulations on the GPU, namely the 
Ising and percolation models, implement vector operation benchmarks 
for the GPU, and flnally compare the CPU's and GPU's performance. A 
general conclusion from the results obtained is that moving computations 
from the CPU to the GPU is feasible, yielding good time and price per- 
formance, for certain lattice computations. Preliminary results also show 
that it is feasible to use them in parallel. 

1 Introduction 

There are several factors that motivated us to investigate the plausibility of 
using programmable GPUs for numerical simulations on structured grids. These 
factors are the GPUs' 

• high flops count, 

• compatible price performance, and 

• better than the CPU rate of performance increase over time. 

To be specific, nVidia's NV30 graphics card is advertised to have a theoretical 
operation count of 200 Gflops/s (see the NV30 preview or Tabled where 
we summarize the results of a GPU-CPU comparison in performing graphics 
related operations). Taking into account the price of the NV30 graphics card 
we get the low 0.1 cents per M flop. Also, the current development tendency 
shows a stable doubling of the GPU's performance every six months, compared 
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to doubling the CPU's performance for every 18 months (a fact vaUd for the 
last 25 years, known as Moore's law). 

Review of the literature on using CPUs for non-graphics applications (sub- 
section shows that scientists report speedups of the GPU compared to the 
CPU for low precision operations. Several of the non-graphics applications that 
make use of the recently available 32-bit floating point arithmetic report perfor- 
mance that is comparable to the CPU's performance. Thus, advertisements or 
scientific papers that report speedups of one or above one order of magnitude 
in favor of the GPU are restricted to low precision operations. For example, the 
advertised 200 Gflops/s in can not be achieved for 32-bit floating point op- 
erations (Figure^ Right gives the NV30 graphics card specifications). Such ad- 
vertisements and reports about "extremely" high performances, combined with 
vagueness about the precision of the computations motivated us to develop sim- 
ple GPU benchmarking software. Benchmarking the NV30 fragment processors 
with simple vector operations showed performance of 7 Gflops/s which is 44% 
of the theoretically possible maximal performance. This is the speed that we 
also achieved on two probability-based applications that we implemented on the 
GPU. 



Problem size 


Frames per second using 


OpcnGL (GPU) 


Mesa (CPU) 


11,540 


189 


8 


47, 636 


52 


1.71 


193,556 


13 


0.44 


780, 308 


3.28 


0.12 



Table 1. GPU vs CPU in rendering polygons. The GPU (Quadro2 Pro) is 
approximately 30 times faster than the CPU (Pentium III, 1 GHz) in rendering 
polygonal data of various sizes. 

Probability-based models, discussed in sectional have a wide area of appli- 
cations. They are computationally intensive and lend themselves naturally to 
implementation on CPUs, as we show in the paper. 

1.1 Literature review 

The use of graphics hardware for non-graphics applications is becoming increas- 
ingly popular. A few examples of such uses are matrix-matrix multiplication 
visual simulations of boiling, convection, and reaction-diffusion processes 
0, non- linear diffusion ^7], multigrid solver PEj, Lattice Boltzmann Method 
(LBM) [El, fast Fourier transform (FFT) [Hj, and 3D convolution 0. For a 
more complete list and more information on general purpose computations that 
make use of the GPU see the GPGPU's homepage: 
[http : // www . gpgpu . org/ 

In all the cases the non-graphics computations are expressed in terms of ap- 
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propriate graphics operations. These graphics operations are executed on the 
graphics card and the results are used to interpret the results of the original non- 
graphics computations. Up until recently the output of the graphics operations 
was constrained to integers, which was a serious obstacle for a meaningful use of 
many of the non-graphics algorithms developed for graphics hardware. For ex- 
ample, M. Rumpf and R. Strzodka 17 used 12-bit arithmetic (InfiniteReality2 
graphics card) for nonlinear diffusion problems, E. Larsen and D. McAllister 
PI used 8-bit arithmetic (GeForceS graphics card) for matrix-matrix multipli- 
cations, etc. The low accuracy of the graphics cards computations encouraged 
research in complicated software techniques to increase the accuracy. See for 
example the range scaling and range separation techniques developed in |12| . 
Even when 16-bit floating point precision became available (in GeForce4), it was 
not uniformly provided throughout the graphics pipeline. This was the reason 
why C. Thompson's et al. [201 general-purpose vector operations framework 
was not able to retrieve the computational results without loss of precision. 

Currently, graphics cards like the nVidia Quadro FX 1000, known also as 
NV30, support 32-bit floating point operations throughout the entire graphics 
pipeline. Features like programmable vertex and fragment shaders (see section 
ll.2|l allow software developers to easily modify the graphics pipeline, which is 
used in most of the non-graphics applications. The programmability also be- 
comes easier since more alternatives to assembly programming become available. 
One example is the high level language Cg |15| . 

By porting non-graphics applications to the GPU, scientists try to achieve 
computational speedup or upload some of the computations from the CPU by 
using the GPU as coprocessor. Currently significant speedups of GPU vs CPU 
are reported if the GPU performs low precision computations. Depending on the 
configuration and the low precision used (often 8-bit) scientists report speedups 
in the range of 25 — 30 times, and as high as 60 times. Advertisements that make 
claims such as that GeForce 4 CPUs are capable of 1.2 trillion operations/s or 
that a supercomputer made out of PlayStations is capable of 0.5 trillion oper- 
ations/sec |11| are based on certain types of low precision graphics operations. 
Our experience shows that currently general purpose applications that use 32- 
bit floating point arithmetic (on NV30) have a speed comparable to the CPU's 
(Pentium 4, 2.8 GHz). By comparable we mean that the GPU may be 2 — 6 
times faster for certain applications, but not "outrageously" (order of magni- 
tude) faster as seen in some low precision computations. This was confirmed in 
our applications and benchmarks, the multigrid solver in 1 , the shaders' speeds 
using 32-bit floating point arithmetic from http : //www . cgshaders . or g etc. 

1.2 Programmable graphics cards 

Old graphics cards had a fixed function graphics pipeline, a schematic view of 
which is given on Figure ^ Up. The operations in the dashed rectangle were 
configured on a very low level and were practically impossible to change by soft- 
ware developers. In August, 1999 nVidia released the GeForce 256 graphics card, 
which allowed a certain degree of programmability of its pipeline. In February, 
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2001 nVidia released the GeForceS GPU, which is considered to be the first fully 
programmable GPU. Here fully programmable means that developers were able 
to provide their own transformations and lighting (T & L) operations (vertex 
shaders) to be performed on the vertices (by the Programmable vertex proces- 
sor) and their own pixel shaders to determine the final pixels color (executed 
on the Programmable fragment processor). Both the vertex and pixel shaders 
are small programs, which when enabled, replace the corresponding fixed func- 
tion pipeline. The programs get executed automatically for every vertex/pixel 
and can change their attributes. Originally the vertex and pixel shaders had 
to be written in assembly. The constantly increasing functionality provided by 
the graphics cards allows for more complex shaders to be written. To simplify 
the implementation of such shaders nVidia recently released a high level shader 
language, called Cg ^JEl- Cg stands for "C for graphics" since it has syntax 
similar to C. 

2 Probability-based simulations 

In this section we briefly describe the models that we implemented on the GPU, 
namely the Ising fSection I2.2|l and the percolation fScction \2.[^ models. Both 
methods are considered to be of Monte Carlo type fSection |2.1() . 

2.1 Monte Carlo simulations 

Monte Carlo (MC) methods are used in the simulation of a variety of phenomena 
in physics, finance, chemistry, etc. MC simulations are based on probability 
statistics and use random numbers. The name derives from the famous Monte 
Carlo resort and is associated with roulette as a simple way of generating random 
numbers. 

A classical example of using MC methods is to compute the area of a circle. 
First the circle is circumscribed by a square, then random locations within the 
square are generated, and finally 

circle area locations within the circle 

square area number of generated locations ' 

where the approximation greatly depends on the number of locations generated 
and the "quality" of the random locations generator. Our aim is to use graphics 
cards for the fast generation of extremely large amounts of random numbers, 
and the model processing associated with the random numbers generated. To be 
specific, we would like to achieve performance close to the theoretically possible 
maximal performance, which is 16 Gflops for the fragment processors of the 
NV30 graphics card. 

A problem of general interest is the computation of expected values. For 
example, assume we have a system that can be in any of its states Si, i = 
1, ...,N with known probabilities P{Si). Also, assume a quantity of interest F 
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is computable for any of the states. Then, the expected value for F, denoted 
by E{F), is given by 

N 

(1) 

1=1 

The Ising model, described in Section is a MC method for computing such 
expected values. The difficulty in computing E{F) is when A'' becomes large. 
If we consider a 2D system of particles, say on a 1024 x 1024 lattice, and every 
particle is modeled to have k possible states, then N is of order /c^"^^ . 

The accuracy of the Monte Carlo simulations, as mentioned above, depends 
on the quality of the random number generator. Computer programs generate 
pseudo-random numbers. For the applications that we consider we used a linear 
congruential type generator (LCG). The form is 

R{n) = (a * R{n - 1) + b) mod N. 

Fixing a starting value R(0), called seed, uniquely determines the numbers gen- 
erated. LCGs are fast, easy to compute, and reasonably accurate. Furthermore, 
they lead to uniformly distributed random numbers and are the most frequently 
used. The LCGs are well understood and studied. One has to be careful in the 
choice of the constants and the seed. Undesirable patterns may occur in appli- 
cations that consider n-tuples of numbers generated by LCGs (see jl3|). 

2.2 Ising model 

The Ising model, a simplified model for magnets, was introduced by Wilhelm 
Lenz in 1920 and further developed later by his student Ernst Ising. The model 
is on a lattice, in our case two dimensional. A "spin" , pointing up or down, 
is associated with every cell of the lattice. The spin corresponds to the ori- 
entation of electrons in the magnet's atoms. There are two opposing physics 
principles that are incorporated in the Ising model: (1) minimization of the 
system's energy, achieved by spins pointing in one direction, and (2) entropy 
maximization, or randomness, achieved by spins pointing in different directions. 
The Ising model uses temperature to couple these opposing principles (see an 
illustration on Figure 12). There are many variations of the Ising model and 
its implementation (see |3] and the literature cited there). The computational 
model that we used is described as follows. 

We want to be able to compute various expected values (equation P), such 
as expected magnetization and expected energy. To compute expected magne- 
tization, F{Si) from equation ^ is the magnetization of state Si, defined as 
the number of spins pointing up minus the number of spins pointing down. To 
compute expected energy, F{Si) is the energy of state Si 

= - E s.imik), 

<j,k> 
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where the sum is over all lattice edges with < j,k > being the edge connecting 
nearest neighbor sites j and k, Si{j) is the spin for site j of state Si with values 
1, for spin pointing for example up, or —1, for spin pointing down. 

The idea is not to compute the quantity described in equation ^ exactly, 
since many of the states are of very low probability, but to evolve the system into 
"more probable" states and get the expected value as the average of several such 
"more probable" states. The user inputs an absolute temperature of interest T 
in Kelvin, and probability p € [0, 1] for spins pointing up. Using this probability 
we generate a random initial state with spins pointing up with probability p. 
The procedure for evolving from this initial state into "more probable" states is 
described in the following paragraph. The theoretical justification of methods 
dealing with similar sequences evolving from state to state, based on certain 
probability decisions, is related to the so-called Markov chains i4j. 

The lattice is colored in a checkerboard manner. We define a sweep as a 
pass through all white or all black sites. This is done so that the order in which 
the sites are processed does not matter. We start consecutive black and white 
sweeps. At every site we make a decision whether to flip its spin based on the 
procedure 

1. Denote the present state as 5, and the state with flipped spin at the 
current site as S'. 

2. Compute AE = E(S") - E(S'). 

3. If AE < accept S' as the new state. 

4. If AE > 0, generate a random number B E [0, 1], and accept S' as the 
new state if 

P{S') _ AE/(fcT) 

otherwise the state remains S. In the last formula the probability P{S) 
for a state S is given by the Boltzmann probability distribution function 

-E(S)/(fcT) 
^(^) = 1^ ' 

J2 e-E(S,)/(fcT) 
1=1 

where k is a constant, known as the Boltzmann's constant. 

A standard Monte Carlo implementation of the Ising model would involve a 
random traversing through different sites of the lattice, and flipping or not the 
spin depending on the outcome of the above procedure. The implementation 
that we describe is influenced by one of M. Creutz's [3] simulations of the Ising 
model. He uses the checkerboard sweeps in a fully deterministic spin flipping 
dynamic, which does not require generation of high quality random numbers 
and yields one order of magnitude faster execution than conventional Monte 
Carlo implementations. For proof of concept in benchmarking the GPU we 
use the algorithm described, which involves a richer variety of mathematical 



6 



operations. The checkerboard sweeps are crucial for the efficient use of the 
GPU, since the GPU computations are performed on passes over the entire 
lattice. Simultaneous update on all the sites would not be a simulation of the 
Ising model, as shown in |21j . 

2.3 Percolation model 

The percolation model is a model for studying disordered media. It was first 
studied by Broadbent and HemmercleyP] in 1957. Since then percolation type 
models have been used in the study of various phenomena, such as the spread of 
diseases, flow in porous media, forest fire propagation, phase transitions, cluster- 
ing, etc. The general scenario is when a medium is modeled as an interconnected 
set of vertices (sites). Values modeling the media and the phenomena of interest 
are associated with the sites and the connections between them. Usually these 
values are modeling disordered media with a certain probability distribution, 
and are obtained from random number generators. A point (or points) of "inva- 
sion" , such as the starting point of fire, disease, etc, is given, and based on the 
values in the sites and the connections, a probabilistic action for the invasion is 
taken. Of particular interest is to determine: 

1. Media modeling threshold after which there exists a spanning (often called 
percolating) cluster. This is an interconnected set of invaded sites that 
spans from one end of the medium to another. 

2. Relations between different media models and time to reach steady state 
invasion or percolation cluster. 

Using the basics of the percolation, as described above, one can derive meth- 
ods of substantial complexity. For example, R. Saskin et al. ^] presented a 
novel approach for the clustering of gene expression patterns using percolation: 
gene expressions are modeled as (1) sites containing m measurements each, and 
(2) connections between all pairs of sites. The connections have weights which 
represent the similarity between the gene expressions. Based on the mutual con- 
nectivity of the gene expressions, they come up with a probabilistic percolation 
model to cluster the data. 

Here, for proof of concept in benchmarking the GPU, we implemented a 
simple percolation model with application to diffusion through porous media 
(see Figure The porous media is modeled on a two dimensional lattice. 
First, we specify porosity as a probability P of the site being a pore. Then 
we go through all the sites and at each site (1) generate a random number 
R £ [0, 1], and (2) consider the site "pore" if i? < P, or solid otherwise. Finally, 
we consider an invasion point, initialize a cluster to be the invasion point, and 
start the process of spanning the initial cluster through the porous space. 
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3 Implementation details 



The implementations of the models described are Cg fragment programs, which 
are being invoked (and executed on the GPU) by OpenGL application running 
on the CPU. To do the OpenGL - fragment programs binding we used the 
OpenGL Cg run-time functions described in the Cg toolkit user's manual |15| . 
The programming model is described in the following paragraph. 

The OpenGL application repeatedly executes the iterative procedure: 

• Prepare the input for the GPU pipeline; 

• If the GPU executes a non-fixed function pipeline specify the vertex or 
fragment programs to be used; 

• If needed, read back the graphics pipeline output (or part of it). 

This programming model is standard. It is very often related to the so-called 
Dynamic texturing, where on the first step one creates a texture T, then a GPU 
fragment program uses T in rendering an image in an off-screen buffer (called 
p-buffer), and finally T is updated from the resulting image. 

To be more specific, we have used GL_TEXTURE_RECTANGLE_NV tex- 
tures, an extension to the 2D texture, which allows the texture dimensions not to 
be a power of 2. To initialize such textures from the main memory we use glTex- 
SubImage2D(GL_TEXTURE_RECTANGLE.NV, ... ). To replace (copy) rect- 
angular image results from the p-buffer to GL_TEXTURE_RECTANGLE_NV 
textures we use glCopyTexSubImage2D(GL_TEXTURE.RECTANGLE.NV, ...), 
which is entirely done on the graphics card. We used the floating point p-buffer 
that is distributed with the Cg toolkit which is available on the Cg homepage 
For our fragment programs we used fp30 profile with options set by the 
cgGLSetOptimalOptions function. 

The times in the benchmarking programs were measured with the gettime- 
ofday function. Since the OpenGL calls are asynchronously executed we used 
glFinish() to enforce OpenGL requests completion before measuring their final 
execution time. The time measures were used to determine the CPU's perfor- 
mance. 

4 Benchmarking the GPU 

We wrote simple fragment programs to benchmark the performance of simple 
floating point vector operations on the GPU. The results were compared with 
those on the CPU. The specifications of the system on which we applied the 
benchmark programs are given in Figure Q Down. We perform vector oper- 
ations, where the vectors are represented by 2D textures. Here we included 
computational results for textures of sizes 256 x 256 and 512 x 512 (with corre- 
sponding vector sizes equal to texture sizes times 4). 

Table Ogives the GPU performance for different fioating point vector oper- 
ations. Table 01 is the corresponding CPU performance. 
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operation 


Lattice size (not necessary power of 2) 


256 X 256 


512 X 512 




0.00063 


0.0024 




0.00073 


0.0027 


COS, sin 


0.00089 


0.0034 


log, exp 


0.00109 


0.0039 



Table 2. Time in seconds for different floating point vector operations on the 
GPU (NV30). The vector sizes are lattice size times 4. 



operation 


Lattice size 


256 X 256 


512 X 512 




0.0011 


0.0046 


COS , sin 


0.0540 


0.0650 


log, exp 


0.0609 


0.1100 



Table 3. Time in seconds for different floating point vector operations on the 
CPU (2.8 GHz Pentium 4). The vector sizes are lattice size times 4. 

The performance is measured as explained in section |21 We performed our 
computations and measures on Linux Red Hat 9 operating system. Problems 
related to floating point p-buffers were observed with nVidia driver display 
4191 from December, 2002. These were fixed in the drivers that followed. The 
computations in Tables 01 and 01 and sections and are done using Linux 
driver display 4363 from April, 2003. Recently we installed the newest driver, 
version 4496 from July, 2003, and observed a speedup of approximately 2 times 
compared to the older driver. 

The operations measured are in the following forms a = c (for the = op- 
eration), a+ = c (for the basic arithmetic operations), and a = sin{a) (for 
the sin, cos, log, and exp operations), where a is a vector represented with a 
2D texture and c is a constant. The a = c operation (Table 0J is present in 
all of the operations considered. Its execution time includes overheads related 
to the graphics pipeline and time for transferring data (reading and writing 
to the local GPU memory of four 32-bit floating point values for each lattice 
vertex). Having the time for this operation allows as to exclude it from the 
other operations and get a better idea about their performance. For example, 
adding a basic arithmetic operation to a fragment program will increase the 
execution time on a lattice of size 512 x 512 with 0.0003 seconds. Thus the 
performance would be 512^ * 4/0.0003 « 3.5 Gflops/s. This is the performance 
to be expected for longer fragment programs, where the overhead time would 
get significantly small. Indeed, in section we achieve this performance for the 
Ising model, which has a fragment program of 109 assembly instructions. The 
overhead time is significantly large compared to the other operations, which 
shows that an approach of making a library of different basic vector operations 
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on the GPU would not be efficient. For example the basic arithmetic operations 
on a 512 x 512 lattice would yield a performance of approximately 0.39 Gflops/s 
instead of the much higher 3.5 Gflops/s. 





Lattice size (not necessary power of 2) 


64 X 64 


128 X 128 


256 X 256 


512 X 512 


1024 X 1024 


GPU sec/frame 


0.0006 


0.0023 


0.0081 


0.033 


0.14 


CPU sec/frame 


0.0008 


0.0020 


0.0069 


0.026 


0.10 



Table 4. GPU (NV30) and CPU (2.8 GHz Pentium 4) performances in pro- 
cessing a single step of the Ising model. 

Overheads related to data traffic are also observed on the CPU, but they 
are more difficult to measure. For example the basic operations would take less 
time to compute than a = c (we use full compiler optimization). This suggests 
that the time of the basic operations is a measure for these overheads. Note 
that the cos, sin, log, and exp are significantly faster on the GPU. Part of the 
reason is that they are accelerated on the GPU by reducing the precision of 
their computation. 

The results from the other benchmark problems that we developed, namely 
the Ising and percolation implementations, are given in section[Sl Measures for 
the CPU-CPU communication speed are given in section (BJ Table |H1 The com- 
munications measured use the AGP 8x port, which has a theoretical bandwidth 
of 2.1 GB/s on the NV30 (Figure □ Down). 

5 Performance results and analysis 

We tested our Ising model implementation with results from Michael Creutz's |3] 
implementation. As expected, for lower input temperatures the minimization of 
the system's energy physical principle prevails, which is expressed by clustering 
of spins pointing in one direction (see Figure |21 Left). For higher tempera- 
tures the entropy maximization physical principle prevails, which is expressed 
by randomness in the spin orientations (see Figure Right). 

We tested our percolation model implementation by experimentally confirm- 
ing the well known fact that the percolation threshold, or in our case the porosity 
threshold for fiuid flow, in 2D is 0.592746 (see !S|). Figure 13 gives a snapshot of 
fluid spreading in 2D porous media. 

Comparison between the GPU and CPU performances on the Ising model is 
given in Table 01 The CPU code is compiled with full optimization. The results 
show that the GPU and the CPU have comparable speed for this application, 
with the CPU being slightly faster. With the fast developments in the graphics 
hardware, it is no surprise that optimization opportunities may be missed at 
the software level. Indeed, as mentioned in section ^ updating the display 
driver increased the CPU's performance approximately by a factor of 2. The 
Ising fragment program gets compiled using the Cg compiler (cgc -profile fp30 
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ising.cg) to 109 instructions. Thus the achieved GPU performance, for example 
on a lattice of size 256 x 256, is 256^ * 109 * 4/0.0081 « 3.5 Gflops/s using 
the old display driver and 7 Gflops/s using the latest display driver, which 
is correspondingly 22% and 44% of the theoretically maximal floating point 
performance of the NV30 GPU (theoretically the NV30 can achieve 16 Gflops/s, 
see Figure n] Down). 

One performance drawback in our Ising model implementation is due to 
the fact that currently GPUs, and in particular the NV30, do not support 
branching in the fragment programs (see JS], Appendix C, Step 9). This means 
that a conditional if /else statement would get executed for the time as if 
all the statements were executed, regardless of the condition. In our case we 
implemented the checkerboard execution pattern with conditional statement in 
the fragment program. Modeling the checkerboard pattern with 2 lattices (one 
for black and one for white squares) and properly modifying the code could 
replace the if /else condition in the fragment program and thus increase the 
speed by a factor of 2. Another important consideration related to the GPU's 
performance is that the operations should be organized in terms of 4D vector 
operations. For a full list of considerations for achieving high performance see 
|15|. Appendix C. 

6 Extensions and future work 

Our implementation of the Ising and the percolation models were designed 
mostly to prove a concept and to be used in benchmarking the GPUs float- 
ing point performance. The implementation goal skipped some optimization 
issues that are important in developing GPU software. Currently we are looking 
closely at the assembly code generated by Cg and are working on its optimiza- 
tion. Another direction is to increase the number of applications for the GPUs. 
We are working on Quantum Chromodynamic (QCD) applications and fluid 
flow simulations. Such applications usually lead to large scale computations 
which are impossible to perform on a single GPU (or GPU) due to memory 
constraints. Therefore, a main direction of our research is on the parallel use 
of programmable GPUs. Table gives the different communication rates be- 
tween the GPU and the GPU for lattices of different sizes. Although the results 
are far from the theoretical maximum, AGP 8x graphics bus with bandwidth 
2.1 GB/s, the CPU-GPU communication speed will not be a bottleneck in a 
commodity-based cluster. 

7 Conclusions 

Our analysis and benchmarking were mainly concentrated on the NV30 frag- 
ment processors' floating point operations performance. The results show that 
it is feasible to use GPUs for numerical simulations. We demonstrated this 
by banchmarking the GPU's performance on simple vector operations and by 



11 



Operation 


Lattice size (not necessary power of 2) 


« speed 
(MB/s) 


64 X 64 


128 X 128 


256 X 256 


512 X 512 


Road iKlr 


O.OOOK) 


().()()()2 


().()()()6 


0.0024 


14 


Read all 


0.00040 


0.0015 


0.0062 


0.0250 


167 


Write bdr 


0.00022 


0.0003 


0.0007 


0.0024 


14 


Write all 


0.00020 


0.0008 


0.0032 


0.0120 


350 



Table 5. Communication rates between the CPU and the GPU for lattices 
of different sizes. The reading is done using the glReadPixels function, writing 
the boundary is done using the glDrawPixels function, and writing the whole 
domain is done using the glTexSubImage2D function. 

implementing two probability-based simulations, namely the Ising and the per- 
colation models. For these two applications, the applications cited in the litera- 
ture overview, our vector operations benchmarks, and various nVidia distributed 
shaders we observed that the fragment processors' (nVidia NV30) floating point 
performance is comparable to the performance of a Pentium 4 processor running 
at 2.8 GHz. By comparable we mean that the GPU may be 2 — 6 times faster for 
certain applications. For example, we achieved 44% of the theoretically possible 
maximal performance. Such performance made the GPU's implementation run 
twice faster than the CPU's. A modification of the algorithm that removes the 
conditional if /else statements related to the checkerboard type traversal of 
the 2D lattices would contribute an additional speedup of at least two times. 
These results make the use of the GPU as a coprocessor appealing. Also, CPUs 
tend to have a higher rate of performance increase over time than the CPUs, 
thus making the study of non-graphics applications on the GPU valuable re- 
search for the future. Specdups of approximately one order of magnitude in 
favor of the GPU are observed only in certain applications involving low pre- 
cision computations. A reason for this difference is the larger traffic involved 
in higher precision computations which traffic makes the CPUs' local memory 
bandwidth a computational bottleneck. Finally, we note that the acceleration 
graphics ports provide enough bandwidth for the CPU-GPU communications 
to make the use of parallel GPUs computations feasible. 
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Figure 1: Up: Graphics pipeline. The dashed rectangle marks the fixed func- 
tion graphics pipeline (with T&L standing for transformation and lighting). 
Additional hardware, programmable vertex and fragment processors, available 
in newer graphics cards provide the developers with opportunity to change the 
fixed function pipeline. Down: Our system configuration and CPU/GPU per- 
formance specifications (Pentium 4 GPU and NV30 GPU). 
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Figure 2: Ising model visualization on lattices of size 512 x 512. The red/blue 
regions correspond to spins pointing up/down. Observe the spin clustering (re- 
lated to energy minimization) for low input temperature (Left) and the random- 
ness (related to entropy maximization) for higher input temperatures (Right). 




Figure 3: Percolation model visualization. Seen is a snapshot of fluid spreading 
in 2D porous media (modeled on a lattice of size 512 x 512, porosity 0.6). 
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